Random forest

실습 데이터

보스턴 주택가격 데이터셋
medv: median value of owner-occupied homes in $1K

Code
library(MASS)
head(Boston)
     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
  medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7

필요한 패키지 로딩

Code
library(caret) # classification and regression training 
library(dplyr)
library(ggplot2)

데이터 분할

Training vs. Test dataset
caret::createDataPartition()

Code
set.seed(1) # quasi-randomization for reproducible result 
train_index = createDataPartition(1:nrow(Boston), p=0.75, list = FALSE) # return index vector 
train_boston = Boston[train_index,]
test_boston = Boston[-train_index,]

Tuning parameters of random-forest model

  • ntree: number of trees (default, 500).
  • mtry: number of features randomly selected at each split

Fit predictive models over different tuning parameters

Code
trctrl = trainControl(method = "repeatedcv", number = 10, repeats = 2) # 
tunegrid = expand.grid(.mtry = c(2,7,13))
rf_model = train(medv ~ ., data = train_boston, method = "rf", trControl = trctrl, tuneGrid = tunegrid, importance = TRUE)  
rf_model
Random Forest 

382 samples
 13 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 2 times) 
Summary of sample sizes: 344, 345, 343, 343, 344, 344, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
   2    3.536917  0.8790339  2.437131
   7    3.275412  0.8814833  2.242504
  13    3.404143  0.8691725  2.321389

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 7.

Model performance in training set

Code
plot(rf_model)

Model performance in test set (I)

Code
pred = predict(rf_model, newdata = test_boston)
df = data.frame(prd = pred, obs =test_boston$medv)
df %>% 
  ggplot(aes(obs, prd)) + 
  geom_point() + 
  geom_abline(intercept = 0, slope = 1)

Model performance in test set (II)

rmse

Code
sqrt(mean((pred - test_boston$medv)^2))
[1] 2.908614

variable importance

Code
rf_imp = varImp(rf_model, scale = F)
plot(rf_imp)

Gradient boosting

Code
set.seed(9)
gbmGrid <-  expand.grid(interaction.depth = c(1,5,10), n.trees = c(50*(1:10)), shrinkage = 0.1, n.minobsinnode = 10) # interaction.depth: max. tree depth 
gbm_model = train(medv~., data = train_boston, trControl = trctrl, method = "gbm", tuneGrid = gbmGrid, verbose = F) 

Gradient boosting

Code
gbm_model
Stochastic Gradient Boosting 

382 samples
 13 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 2 times) 
Summary of sample sizes: 346, 344, 342, 342, 345, 343, ... 
Resampling results across tuning parameters:

  interaction.depth  n.trees  RMSE      Rsquared   MAE     
   1                  50      4.247584  0.8007864  2.977374
   1                 100      3.926869  0.8262090  2.742294
   1                 150      3.783172  0.8380866  2.657258
   1                 200      3.694597  0.8460769  2.615573
   1                 250      3.632247  0.8514102  2.603098
   1                 300      3.606230  0.8546869  2.594943
   1                 350      3.567493  0.8576379  2.587604
   1                 400      3.548092  0.8594899  2.591343
   1                 450      3.533058  0.8608667  2.570632
   1                 500      3.532273  0.8617898  2.581943
   5                  50      3.448365  0.8612565  2.359094
   5                 100      3.291913  0.8752094  2.248381
   5                 150      3.197842  0.8832918  2.200703
   5                 200      3.197905  0.8835831  2.213877
   5                 250      3.181930  0.8852066  2.200321
   5                 300      3.210283  0.8833751  2.210406
   5                 350      3.209951  0.8839370  2.210860
   5                 400      3.210492  0.8839794  2.212740
   5                 450      3.220521  0.8833170  2.217916
   5                 500      3.218274  0.8835087  2.219301
  10                  50      3.316865  0.8745845  2.251588
  10                 100      3.182085  0.8847252  2.157877
  10                 150      3.125601  0.8892863  2.123264
  10                 200      3.098105  0.8921043  2.126052
  10                 250      3.096366  0.8922769  2.120877
  10                 300      3.103717  0.8921056  2.130409
  10                 350      3.115558  0.8914627  2.135033
  10                 400      3.118766  0.8914652  2.138826
  10                 450      3.119457  0.8913968  2.138764
  10                 500      3.118411  0.8914066  2.140053

Tuning parameter 'shrinkage' was held constant at a value of 0.1

Tuning parameter 'n.minobsinnode' was held constant at a value of 10
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were n.trees = 250, interaction.depth =
 10, shrinkage = 0.1 and n.minobsinnode = 10.

Model performance in training set

Code
plot(gbm_model)

Variable importance

Code
summary(gbm_model, las = 1)
            var     rel.inf
lstat     lstat 43.65029017
rm           rm 29.63374670
dis         dis  7.85736097
age         age  3.76535048
nox         nox  3.76433236
crim       crim  3.30882481
ptratio ptratio  2.36756617
black     black  2.33358696
tax         tax  1.85977227
indus     indus  0.72711734
rad         rad  0.54572978
zn           zn  0.09398419
chas       chas  0.09233780

Evaluate performance in test dataset

Code
gbm_pred = predict(gbm_model, newdata = test_boston)
sqrt(mean(gbm_pred - test_boston$medv)^2)
[1] 0.06135331

Comparison btw RF vs. GBM performance

Code
resamps = resamples(list(RF = rf_model, 
               GBM = gbm_model))
resamps

Call:
resamples.default(x = list(RF = rf_model, GBM = gbm_model))

Models: RF, GBM 
Number of resamples: 20 
Performance metrics: MAE, RMSE, Rsquared 
Time estimates for: everything, final model fit 

Comparison btw RF vs. GBM performance

Code
summary(resamps)

Call:
summary.resamples(object = resamps)

Models: RF, GBM 
Number of resamples: 20 

MAE 
        Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
RF  1.817414 2.030482 2.127293 2.242504 2.399982 2.843888    0
GBM 1.346376 1.921460 2.173535 2.120877 2.296872 2.932426    0

RMSE 
        Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
RF  2.232549 2.782236 3.001310 3.275412 3.625193 5.480696    0
GBM 1.800942 2.543783 2.922687 3.096366 3.402958 5.250482    0

Rsquared 
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
RF  0.7247324 0.8622467 0.8908405 0.8814833 0.9249318 0.9498064    0
GBM 0.6497052 0.8692037 0.9060029 0.8922769 0.9372896 0.9528183    0

Comparison btw RF vs. GBM performance

Code
bwplot(resamps)